Bandit Project

Abstract

In probability theory, the multi-armed bandit problem (sometimes called the K- or N-armed bandit problem) is a problem in which a fixed limited set of resources must be allocated between competing (alternative) choices in a way that maximizes their expected gain, when each choice's properties are only partially known at the time of allocation, and may become better understood as time passes or by allocating resources to the choice. This is a classic reinforcement learning problem that exemplifies the exploration–exploitation tradeoff dilemma. The name comes from imagining a gambler at a row of slot machines (sometimes known as "one-armed bandits"), who has to decide which machines to play, how many times to play each machine and in which order to play them, and whether to continue with the current machine or try a different machine. The multi-armed bandit problem also falls into the broad category of stochastic scheduling.[3]

In the problem, each machine provides a random reward from a probability distribution specific to that machine. The objective of the gambler is to maximize the sum of rewards earned through a sequence of lever pulls.The crucial tradeoff the gambler faces at each trial is between "exploitation" of the machine that has the highest expected payoff and "exploration" to get more information about the expected payoffs of the other machines.[3]

Contents

Settings

imports

Global background

Useful tools

Main Codes

Result

greedy

Test Code

Trying more epsilon to find the best one

Code to visualize the data

Result evaluation

UCB

Test Code

Trying more c to find the best one

Code to visualize the data

Result evaluation

TS

Test Code

Code to visualize the data

Result evaluation

Comparison and Understanding

Dependent Case

Case 1: dependent on the other two arms

Take a certain case as an example:

test for dependent case 1

Case 2: dependent on the last pull

We use single quotation marks to represent the last return value of the arm, and then we choose a specific occasion, given A=B=C=1 at the 0 pull, and we have the following probability

test for dependent case 2

Evaluation and analysis of the results of independence

No matter what independent case we set, we can always calculate the margin probability of each arm. This makes the independence or not have no impact on the operation of the algorithm, because after a certain amount of exploration experiments, the algorithm can always approach the real margin probability. At this time, treating the arm as independent has no impact on the expected results.

That is, taking the dependent case is equivalent to replacing the arm probability given by the problem, and then still simulate it according to the independent case. The results obtained by each algorithm are consistent with the above discussion.

Moreover, in the actual situation, we will not know whether the arm is independent or not, so the sub simulation of non independent situation is also appropriate to the reality.

With Constraints

It is assumed that each time the arm is pulled, it costs $m, which makes it possible for us to go bankrupt due to too much exploration during the experiment, thus losing the possibility of expansion. Therefore, we need an index to limit when it is appropriate to explore and when it is appropriate to expand.

In an extreme case, participants were given 50 dollars at the beginning of the experiment, and it cost 0.775 dollar to pull the arm each time. Thus, the main code is updated as follows:

Bankrupt Examples

Taking several specific algorithms and parameters as examples, we can see that the algorithm has probability of bankruptcy after adding the cost of pulling arm.

Visualization

Update the algorithm

Based on the above discussion, we look for a parameter to measure our resistance to bankruptcy.

For greedy algorithms, when there is less money, we should expand more and explore less, that is, epsilon should decrease with the decrease of money.

Similarly, for UCB algorithm, the value of c should decrease with the decrease of money. Therefore, in these two algorithms, we try to find our parameters in linear and exponential ways respectively. Then there is the following formula:

$$flag_1 = \frac{money}{expected\ reward}$$$$flag_2 = 1-e^{-x*money}$$$$\epsilon' = \epsilon*flag$$$$c'=c*flag$$

where we take expected value as 200, x as 0.001 for an example.

For TS algorithm, we should adjust the a priori probability before each experiment according to the amount of money, so that the arm with high probability accounts for a higher possibility of pulling. However, unless the a priori probability deviation is very large, TS algorithm will probably choose the best arm, making the possibility of bankruptcy very low. When the prior probability deviation is great, we should not be limited to adjusting the TS algorithm, but should replace a new algorithm to ignore this deviation.

Linear Way to update the algorithm

Test Code

Exponential way to update the algotirhm

Test Code

Visualize

Evaluation

It can be seen from the figure that different update methods have different effects on different parameters and algorithms. This forces us to set different resistance to bankruptcy according to the algorithm when facing the cost arm, so as to obtain better reward.

References

[1] Reinforcement Learning: An Introduction (second edition), R. Sutton & A. Barto, 2018.

[2] Project-Slides (SI140 Fall 2020), Ziyu Shao (ShanghaiTech), 2020.

[3] https://en.wikipedia.org/wiki/Multi-armed_bandit